Statistics
Basic genomes statistics The basic statistics are values calculated from raw DNA data, not genes. The results include: *Total length (base pairs) *Percentage AT *Standard deviation AT (in the case of multiple replicons/contigs) *Number of replicons/contigs *Percentage of unknown bases (not A, T, C or G) *Fraction of genome made up by largest contig/replicon, as a percentage of total genome length. This measure is mostly useful for evaluation if most of the genome is in one piece or if it is completely fragmented. genomeStatistics .fna Filename TotalBases: Per.AT: StDevAT: ContigCount: Per.Unknowns: Per.LargestSeq .fna 2132142 61.3707 0.0000 1 0.0000 100.0000 Unknown bases analysis In some DNA sequences bases other than A, T, C or G are found. This can be a function of assembly programs where the distance between two sequences are known but not the sequence itself. The analysis of these DNA signatures produces the following measures: *Total number of bases 2209947 *Total number of unknown stretches 99 *Total number of unknowns 79605 *Percentage of unknowns 3.60212258484027 *Average length of unknown stretch 804.090909090909 *Max/min length of unknown stretch 1780 141 The program is called as follows: countUnknowns.pl Megamonas_hypermegale_ART12_1.fna Amino acid and codon usage This system has some different ways of analyzing the third position base use, amino acid and codon usage. The first is a vizual presentation which should be used to present the patterns of a few genomes. This approach is not usefull for comparing many genomes. The analysis uses the open reading frame genes from a genefinder (DNA open reading frames, FASTA format): >NZ_ADFU01000001__CDS_1275-526 ATGAAAAAATCCACTTTGCTTGCTTTCACAGCGGCAGTATTATTCGGCAGTGGCGT CACGTTAATGCGGCATCTGCTACATATGATGATCCATTGCTTTTACCAAATCCTGC GCGCCTACAACAGGTTCTGTTGTATTGGTTCCTGTGGCTAGCCCTCAGGCGGTGCA ............ The output of the analysis is a PDF file along with a raw data file, format shown bellow: Veillonella_parvula_ATCC_17745_prodigal.orf.fna TotalBases: 1900137 PerAT: 60.38 StDevAT: 0.04 codon AAA 4.39974 27867 codon CAA 2.79548 17706 ......... aa C 0.9828 aa P 3.6291 The analysis should be performed in a directory which has a file called .orf.fna, and is run as follows: basicGenomeAnalysis organismName /usr/bin/gnuplot It is also possible to just run the calculations, without the visual presentation. This is more useful for comparing many genomes. for i in *fna do perl /usr/biotools/indirect/atStats.pl $i > $i.atStats.tab cat $i.atStats.tab > $i.CodonAaUsage perl /usr/biotools/indirect/CodonAaUsage.pl $i >> $i.CodonAaUsage rm $i.atStats.tab done To collect all the data for all genomes construct one file per type of data (amino acid usage, codon usage and statistics): grep aa *AaUsage > aaUsage.all sed -i s/_prodigal.orf.fna.CodonAaUsage:aa//g aaUsage.all grep Total *AaUsage > statistics.all sed -i 's/_prodigal.orf.fna.CodonAaUsage:/\t/g' statistics.all cut -f2,3,4,5,6,7,8 statistics.all > tmp.all mv tmp.all statistics.all sed -i 's/_prodigal.orf.fna//g' statistics.all grep codon *AaUsage > codonUsage.all sed -i s/_prodigal.orf.fna.CodonAaUsage:codon//g codonUsage.all Questions *Use head and tail to have a look at these files. What do they contain? These files can be used to plot several different patterns using Excel, R or other plotting programs. What this data can be used for depends on the aim of the study and can not be standardized. Here is shown an approach which compares XX of all genomes and graphically represents the results as a XX plot, To see how these plots were made, go to the R page: http://biotoolscmg.wikia.com/wiki/R#Reshape_table_to_matrix_.28heatmap.29